Towards Recognizing Unseen Categories in Unseen Domains
迈向在未见领域中识别未见类别


Massimiliano Mancini 1,2[0000000185959955] , Zeynep Akata 2[0000000214327747] ,Elisa Ricci 3,4[0000000202281147] ,and Barbara Caputo 5,6[0000000171690158]
马西米利亚诺·曼奇尼 1,2[0000000185959955] ,泽内普·阿卡塔 2[0000000214327747] ,伊莉莎·里奇 3,4[0000000202281147] ,以及芭芭拉·卡普托 5,6[0000000171690158]

1 Sapienza University of Rome, 2 University of Tübingen, 3 University of Trento, 4 Fondazione Bruno Kessler, 5 Politecnico di Torino, 6 Italian Instituite of Technology
1 罗马大学(Sapienza University of Rome),2 图宾根大学(University of Tübingen),3 特伦托大学(University of Trento),4 布鲁诺·凯斯勒基金会(Fondazione Bruno Kessler),5 都灵理工大学(Politecnico di Torino),6 意大利技术研究院(Italian Instituite of Technology)

mancini@diag.uniroma1.it

Abstract. Current deep visual recognition systems suffer from severe performance degradation when they encounter new images from classes and scenarios unseen during training. Hence, the core challenge of Zero-Shot Learning (ZSL) is to cope with the semantic-shift whereas the main challenge of Domain Adaptation and Domain Generalization (DG) is the domain-shift. While historically ZSL and DG tasks are tackled in isolation, this work develops with the ambitious goal of solving them jointly, i.e. by recognizing unseen visual concepts in unseen domains. We present CuMix (Curriculum Mixup for recognizing unseen categories in unseen domains), a holistic algorithm to tackle ZSL, DG and ZSL+DG. The key idea of CuMix is to simulate the test-time domain and semantic shift using images and features from unseen domains and categories generated by mixing up the multiple source domains and categories available during training. Moreover, a curriculum-based mixing policy is devised to generate increasingly complex training samples. Results on standard ZSL and DG datasets and on ZSL+DG using the DomainNet benchmark demonstrate the effectiveness of our approach.
摘要:当前的深度视觉识别系统在遇到训练期间未见类别的新图像和场景时,性能会严重下降。因此,零样本学习(Zero - Shot Learning,ZSL)的核心挑战是应对语义偏移,而领域自适应和领域泛化(Domain Adaptation and Domain Generalization,DG)的主要挑战是领域偏移。虽然历史上ZSL和DG任务是分开处理的,但这项工作以联合解决它们为宏伟目标,即通过在未见领域中识别未见视觉概念。我们提出了CuMix(用于在未见领域中识别未见类别的课程混合法,Curriculum Mixup for recognizing unseen categories in unseen domains),这是一种用于解决ZSL、DG和ZSL + DG的整体算法。CuMix的关键思想是使用训练期间通过混合多个可用源领域和类别生成的未见领域和类别的图像和特征来模拟测试时的领域和语义偏移。此外,还设计了一种基于课程的混合策略来生成日益复杂的训练样本。在标准ZSL和DG数据集以及使用DomainNet基准的ZSL + DG上的实验结果证明了我们方法的有效性。

Keywords: Zero-Shot Learning, Domain Generalization
关键词:零样本学习,领域泛化

1 Introduction
1 引言


Despite their astonishing success in several applications [1234], deep visual models perform poorly for the classes and scenarios that are unseen during training. Most existing approaches are based on the assumptions that (a) training and test data come from the same underlying distribution, i.e. domain shift, and (b) the set of classes seen during training constitute the only classes that will be seen at test time, i.e. semantic shift. These assumptions rarely hold in practice and, in addition to depicting different semantic categories, training and test images may differ significantly in terms of visual appearance in the real world.
尽管深度视觉模型在多个应用中取得了惊人的成功 [1234],但它们在处理训练期间未见的类别和场景时表现不佳。大多数现有方法基于以下假设:(a)训练数据和测试数据来自相同的底层分布,即领域偏移;(b)训练期间看到的类别集构成了测试时将看到的唯一类别,即语义偏移。这些假设在实践中很少成立,并且除了描绘不同的语义类别外,训练图像和测试图像在现实世界中的视觉外观可能会有显著差异。

To address these limitations, research efforts have been devoted to designing deep architectures able to cope with varying visual appearance [7] and with novel semantic concepts [47]. In particular, the domain-shift problem [15] has been addressed by proposing domain adaptation (DA) models [7] that assume the availability target domain data during training. To circumvent this assumption, a recent trend has been to move to more complex scenarios where the adaptation problem must be either tackled online [1424], with the help of target domain descriptions [23, auxiliary data [32] or multiple source domains [25, 26, 36]. For instance, domain generalization (DG) methods [1921 5] aim to learn domain-agnostic prediction models and to generalize to any unseen target domain.
为了解决这些局限性,研究人员致力于设计能够应对不同视觉外观 [7] 和新语义概念 [47] 的深度架构。特别是,领域偏移问题 [15] 已通过提出领域自适应(Domain Adaptation,DA)模型 [7] 得到解决,这些模型假设在训练期间可以获得目标领域数据。为了规避这一假设,最近的趋势是转向更复杂的场景,在这些场景中,自适应问题必须在线解决 [1424],借助目标领域描述 [23]、辅助数据 [32] 或多个源领域 [25, 26, 36]。例如,领域泛化(Domain Generalization,DG)方法 [1921 5] 旨在学习与领域无关的预测模型,并推广到任何未见的目标领域。



Fig. 1. Our ZSL+DG problem. During training we have images of multiple categories (e.g. elephant,horse) and domains (e.g. photo, cartoon). At test time, we want to recognize unseen categories (e.g. dog, giraffe), as in ZSL, in unseen domains (e.g. paintings), as in DG, exploiting side information describing seen and unseen categories.
图1. 我们的ZSL + DG问题。在训练期间,我们有多个类别的图像(例如大象、马)和领域(例如照片、卡通)。在测试时,我们希望像ZSL那样识别未见类别(例如狗、长颈鹿),像DG那样在未见领域(例如绘画)中进行识别,并利用描述已见和未见类别的辅助信息。


Regarding semantic knowledge, multiple works have designed approaches for extending deep architectures to handle new categories and new tasks. For instance, continual learning methods [18] attempt to sequentially learn new tasks while retaining previous knowledge, tackling the catastrophic forgetting issue. Similarly, in open-world recognition [4] the goal is to detect unseen categories and successfully incorporate them into the model. Another research thread is Zero-Shot Learning (ZSL) [47], where the goal is to recognize objects unseen during training given external information about the novel classes provided in forms of semantic attributes [17], visual descriptions [2] or word embeddings [27].
关于语义知识,多项工作设计了扩展深度架构以处理新类别和新任务的方法。例如,持续学习方法 [18] 试图在保留先前知识的同时顺序学习新任务,解决灾难性遗忘问题。同样,在开放世界识别 [4] 中,目标是检测未见类别并将其成功融入模型。另一个研究方向是零样本学习(Zero - Shot Learning,ZSL)[47],其目标是根据以语义属性 [17]、视觉描述 [2] 或词嵌入 [27] 形式提供的关于新类别的外部信息,识别训练期间未见的对象。

Despite these significant efforts, an open research question is whether we can tackle the two problems jointly. Indeed, due to the large variability of visual concepts in the real world, in terms of both semantics and acquisition conditions, it is impossible to construct a training set capturing such variability. This calls for a holistic approach addressing them together. Consider for instance the case depicted in Fig. 1. A system trained to recognize elephants and horses from realistic images and cartoons might be able to recognize the same categories in another visual domain, like art paintings (Fig. 1, bottom) or it might be able to describe other quadrupeds in the same training visual domains (Fig. 1, top). On the other hand, how to deal with the case where new animals are shown in a new visual domain is not clear.
尽管付出了这些重大努力,但一个悬而未决的研究问题是,我们是否能够同时解决这两个问题。实际上,由于现实世界中视觉概念在语义和获取条件方面存在巨大差异,不可能构建一个能够捕捉这种差异的训练集。这就需要一种整体方法来共同解决这些问题。例如,考虑图1所示的情况。一个经过训练能够从真实图像和卡通中识别大象和马的系统,可能能够在另一个视觉领域(如艺术绘画,图1底部)识别相同的类别,或者能够描述同一训练视觉领域中的其他四足动物(图1顶部)。另一方面,如何处理在新的视觉领域中出现新动物的情况尚不清楚。
To our knowledge, our work is the first attempt to answer this question, proposing a method that is able to recognize unseen semantic categories in unseen domains. In particular, our goal is to jointly tackle ZSL and DG (see Fig 1). ZSL algorithms usually receive as input a set of images with their associated semantic descriptions, and learn the relationship between an image and its semantic attributes. Likewise, DG approaches are trained on multiple source domains and at test time are asked to classify images, assigning labels within the same set of source categories but in an unseen target domain. Here we want to address the scenario where, during training, we are given a set of images of multiple domains and semantic categories and our goal is to build a model able to recognize images of unseen concepts, as in ZSL, in unseen domains, as in DG.
据我们所知,我们的工作是首次尝试回答这个问题,提出了一种能够在未见领域中识别未见语义类别的方法。具体而言,我们的目标是同时解决零样本学习(ZSL)和领域泛化(DG)问题(见图1)。零样本学习算法通常将一组带有相关语义描述的图像作为输入,并学习图像与其语义属性之间的关系。同样,领域泛化方法在多个源领域上进行训练,在测试时需要对图像进行分类,在相同的源类别集合中为未见目标领域的图像分配标签。在这里,我们要解决的场景是,在训练过程中,我们得到一组来自多个领域和语义类别的图像,我们的目标是构建一个模型,该模型能够像零样本学习那样识别未见概念的图像,像领域泛化那样在未见领域中进行识别。

To achieve this, we need to address challenges usually not present when these two classical tasks, i.e. ZSL and DG, are considered in isolation. For instance, while in DG we can rely on the fact that the multiple source domains permit to disentangle semantic and domain-specific information, in ZSL+DG we have no guarantee that the disentanglement will hold for the unseen semantic categories at test time. Moreover, while in ZSL it is reasonable to assume that the learned mapping between images and semantic attributes will generalize also to test images of the unseen concepts, in ZSL+DG we have no guarantee that this will happen for images of unseen domains.
为了实现这一目标,我们需要应对在单独考虑这两个经典任务(即零样本学习和领域泛化)时通常不会出现的挑战。例如,在领域泛化中,我们可以依靠多个源领域来分离语义信息和特定领域信息,但在零样本学习 + 领域泛化中,我们无法保证在测试时这种分离对于未见语义类别仍然有效。此外,在零样本学习中,合理假设所学习的图像与语义属性之间的映射也能推广到未见概念的测试图像上,但在零样本学习 + 领域泛化中,我们无法保证这对于未见领域的图像也会成立。

To overcome these issues, during training we simulate both the semantic and the domain shift we will encounter at test time. Since explicitly generating images of unseen domains and concepts is an ill-posed problem, we sidestep this issue and we synthesize unseen domains and concepts by interpolating existing ones. To do so, we revisit the mixup [53] algorithm as a tool to obtain partially unseen categories and domains. Indeed, by randomly mixing samples of different categories we obtain new samples which do not belong to a single one of the available categories during training. Similarly, by mixing samples of different domains, we obtain new samples which do not belong to a single source domain available during training.
为了克服这些问题,在训练过程中,我们模拟了在测试时会遇到的语义和领域偏移。由于显式生成未见领域和概念的图像是一个不适定问题,我们避开了这个问题,通过对现有领域和概念进行插值来合成未见领域和概念。为此,我们重新审视了混合(mixup)[53]算法,将其作为一种获取部分未见类别和领域的工具。实际上,通过随机混合不同类别的样本,我们可以得到在训练期间不属于任何一个可用类别的新样本。同样,通过混合不同领域的样本,我们可以得到在训练期间不属于任何一个可用源领域的新样本。

Under this perspective, mixing samples of both different domains and classes allows to obtain samples that cannot be categorized in a single class and domain of the one available during training, thus they are novel both for the semantic and their visual representation. Since higher levels of abstraction contain more task-related information, we perform mixup at both image and feature level, showing experimentally the need for this choice. Moreover, we introduce a curriculum-based mixing strategy to generate increasingly complex training samples. We show that our CuMix (Curriculum Mixup for recognizing unseen categories in unseen domains) model obtains state-of-the-art performances in both ZSL and DG in standard benchmarks and it can be effectively applied to the combination of the two tasks, recognizing unseen categories in unseen domains 7
从这个角度来看,混合不同领域和类别的样本可以得到在训练期间无法归类到单一类别和领域的样本,因此这些样本在语义和视觉表示上都是新颖的。由于更高层次的抽象包含更多与任务相关的信息,我们在图像和特征层面都进行混合操作,并通过实验证明了这种选择的必要性。此外,我们引入了一种基于课程的混合策略,以生成越来越复杂的训练样本。我们表明,我们的CuMix(用于在未见领域中识别未见类别的课程混合)模型在标准基准测试中在零样本学习和领域泛化方面都取得了最先进的性能,并且可以有效地应用于这两个任务的组合,在未见领域中识别未见类别。


7 The code is available at https://github.com/mancinimassimiliano/CuMix
7 代码可在https://github.com/mancinimassimiliano/CuMix获取。


To summarize, our contributions are as follows. (i) We introduce the ZSL+DG scenario, a first step towards recognizing unseen categories in unseen domains. (ii) Being the first holistic method able to address ZSL, DG, and the two tasks together, our method is based on simulating new domains and categories during training by mixing the available training domains and classes both at image and feature level. The mixing strategy becomes increasingly more challenging during training, in a curriculum fashion. (iii) Through our extensive evaluations and analysis, we show the effectiveness of our approach in all three settings: namely ZSL, DG and ZSL+DG.
综上所述,我们的贡献如下。(i)我们引入了零样本学习 + 领域泛化场景,这是朝着在未见领域中识别未见类别迈出的第一步。(ii)作为第一种能够同时解决零样本学习、领域泛化以及这两个任务的整体方法,我们的方法基于在训练期间通过在图像和特征层面混合可用的训练领域和类别来模拟新的领域和类别。在训练过程中,混合策略以课程式的方式变得越来越具有挑战性。(iii)通过广泛的评估和分析,我们证明了我们的方法在三种设置下(即零样本学习、领域泛化和零样本学习 + 领域泛化)的有效性。

2 Related Works
2 相关工作


Domain Generalization (DG). Over the past years the research community has put considerable efforts into developing methods to contrast the domain shift. Opposite to domain adaptation [7], where it is assumed that target data are available in the training phase, the key idea behind DG is to learn a domain agnostic model to be applied to any unseen target domain.
领域泛化(DG)。在过去几年中,研究界投入了大量精力来开发应对领域偏移的方法。与领域自适应[7]相反,领域自适应假设在训练阶段可以获得目标数据,而领域泛化背后的关键思想是学习一个与领域无关的模型,以便应用于任何未见的目标领域。

Previous DG methods can be broadly grouped into four main categories. The first category comprises methods which attempt to learn domain-invariant feature representations [28] by considering specific alignment losses, such as maximum mean discrepancy (MMD), adversarial loss [22] or self-supervised losses 5. The second category of methods [1915] develop from the idea of creating deep architectures where both domain-agnostic and domain-specific parameters are learned on source domains. After training, only the domain-agnostic part is retained and used for processing target data. The third category devises specific optimization strategies or training procedures in order to enhance the generalization ability of the source model to unseen target data. For instance, in [20] a meta-learning approach is proposed for DG. Differently, in [21] an episodic training procedure is presented to learn models robust to the domain shift. The latter category comprises methods which introduce data and feature augmentation strategies to synthesize novel samples and improve the generalization capability of the learned model [39 43 42]. These strategies are mostly based on adversarial training [39 43].
先前的领域泛化(DG)方法大致可分为四大类。第一类方法试图通过考虑特定的对齐损失,如最大均值差异(MMD)、对抗损失 [22] 或自监督损失 5,来学习领域不变的特征表示 [28]。第二类方法 [19,15] 源于这样一种理念:创建深度架构,在源领域上学习领域无关和领域特定的参数。训练完成后,仅保留领域无关部分并用于处理目标数据。第三类方法设计特定的优化策略或训练程序,以增强源模型对未见目标数据的泛化能力。例如,在 [20] 中提出了一种用于领域泛化的元学习方法。不同的是,在 [21] 中提出了一种 episodic 训练程序,以学习对领域偏移具有鲁棒性的模型。最后一类方法引入数据和特征增强策略,以合成新样本并提高所学模型的泛化能力 [39,43,42]。这些策略大多基于对抗训练 [39,43]。

Our work is related to the latter category since we also generate synthetic samples with the purpose of learning more robust target models. However, differently from previous methods, we specifically employ mixup to perturb feature representations. Recently, works have considered mixup in the context of domain adaptation [51] to e.g. reinforce the judgments of a domain discrimination. However, we employ mixup from a different perspective i.e. simulating semantic and domain shift we will encounter at test time. To this extent, we are not aware of previous methods using mixup for DG and ZSL.
我们的工作与最后一类相关,因为我们同样以学习更鲁棒的目标模型为目的生成合成样本。然而,与先前的方法不同,我们特别采用 mixup 来扰动特征表示。最近,有研究在领域自适应 [51] 的背景下考虑使用 mixup,例如强化领域判别判断。然而,我们从不同的角度使用 mixup,即模拟测试时会遇到的语义和领域偏移。就此而言,我们尚未发现先前有方法将 mixup 用于领域泛化和零样本学习(ZSL)。

Zero-Shot Learning (ZSL). Traditional ZSL approaches attempt to learn a projection function mapping images/visual features to a semantic embedding space where classification is performed. This idea is achieved by directly predicting image attributes e.g. [17] or by learning a linear mapping through margin-based objective functions [12]. Other approaches explored the use of non-linear multi-modal embeddings [45], intermediate projection spaces [54,55] or similarity-based interpolation of base classifiers [6]. Recently, various methods tackled ZSL from a generative point of view considering Generative Adversarial Networks [48, Variational Autoencoders (VAE) [38] or both of them 50. While none of these approaches explicitly tackled the domain shift, i.e. visual appearance changes among different domains/datasets, various methods proposed to use domain adaptation technique, e.g. to refine the semantic embedding space, aligning semantic and projected visual features [38] or, in transductive scenarios, to cope with the inherent domain shift existing among the appearance of attributes in different categories [16, 9 10]. For instance, in 38 a distance among visual and semantic embedding projected in the VAE latent space is minimized. In [16] the problem is addressed through a regularised sparse coding framework, while in [9] a multi-view hypergraph label propagation framework is introduced.
零样本学习(Zero-Shot Learning,ZSL)。传统的零样本学习方法试图学习一个投影函数,将图像/视觉特征映射到一个语义嵌入空间,在该空间中进行分类。这一理念可通过直接预测图像属性(例如 [17])或通过基于边界的目标函数学习线性映射 [12] 来实现。其他方法探索了使用非线性多模态嵌入 [45]、中间投影空间 [54,55] 或基于相似度的基分类器插值 [6]。最近,各种方法从生成的角度处理零样本学习问题,考虑使用生成对抗网络 [48]、变分自编码器(Variational Autoencoders,VAE)[38] 或两者结合 [50]。虽然这些方法都没有明确处理领域偏移问题,即不同领域/数据集之间的视觉外观变化,但各种方法建议使用领域自适应技术,例如细化语义嵌入空间、对齐语义和投影视觉特征 [38],或者在直推式场景中,处理不同类别属性外观之间存在的固有领域偏移 [16,9,10]。例如,在 [38] 中,最小化了投影到变分自编码器潜在空间中的视觉和语义嵌入之间的距离。在 [16] 中,通过正则化稀疏编码框架解决该问题,而在 [9] 中引入了多视图超图标签传播框架。
Recently, works have considered also coupling ZSL and DA in a transductive setting. For instance, in [56] a semantic guided discrepancy measure is employed to cope with the asymmetric label space among source and target domains. In the context of image retrieval, multiple works addressed the sketch-based image retrieval problem [52 8], even across multiple domains. In 40 the authors proposed a method to perform cross-domain image retrieval by training domain-specific experts. While these approaches integrated DA and ZSL, none of them considered the more complex scenario of DG, where no target data are available.
最近,有研究也考虑在直推式设置中结合零样本学习和领域自适应。例如,在 [56] 中采用了一种语义引导的差异度量来处理源领域和目标领域之间不对称的标签空间。在图像检索的背景下,多项研究解决了基于草图的图像检索问题 [52,8],甚至跨多个领域。在 [40] 中,作者提出了一种通过训练特定领域专家进行跨领域图像检索的方法。虽然这些方法将领域自适应和零样本学习相结合,但它们都没有考虑更复杂的领域泛化场景,即没有目标数据可用的情况。

3 Method
3 方法


In this section, we first formalize the Zero-Shot Learning under Domain Generalization (ZSL+DG). We then describe our approach, CuMix , which, by performing curriculum learning through mixup, simulates the domain- and semantic-shift the network will encounter at test time, and can be holistically applied to ZSL, DG and ZSL+DG.
在本节中,我们首先对领域泛化下的零样本学习(ZSL+DG)进行形式化定义。然后,我们描述我们的方法 CuMix,该方法通过 mixup 进行课程学习,模拟网络在测试时会遇到的领域和语义偏移,并且可以整体应用于零样本学习、领域泛化和领域泛化下的零样本学习(ZSL+DG)。

3.1 Problem Formulation
3.1 问题表述


In the ZSL+DG problem, the goal is to recognize unseen categories (as in ZSL) in unseen domains (as in DG). Formally,let X denote the input space (e.g. the image space), Y the set of possible classes and D the set of possible domains. During training,we are given a set S={(xi,yi,di)}i=1n where xiX,yiYs and diDs . Note that YsY and DsD and,as in standard DG,we have multiple source domains (i.e. Ds=j=1mdj ,with m>1 ) with different distributions i.e. pX(xdi)pX(xdj),ij .
在零样本学习+领域泛化(ZSL+DG)问题中,目标是识别未见领域(如领域泛化问题)中的未见类别(如零样本学习问题)。形式上,设X表示输入空间(例如图像空间),Y表示可能的类别集合,D表示可能的领域集合。在训练过程中,我们会得到一个集合S={(xi,yi,di)}i=1n,其中xiX,yiYsdiDs。注意,YsYDsD,并且与标准的领域泛化问题一样,我们有多个具有不同分布的源领域(即Ds=j=1mdj,其中m>1),即pX(xdi)pX(xdj),ij

Given S our goal is to learn a function h mapping an image x of domains DuD to its corresponding label in a set of classes YuY . Note that in standard ZSL,while the set of train and test domains are shared,i.e. DsDu , the label sets are disjoint i.e. YsYu ,thus Yu is a set of unseen classes. On the other hand,in DG we have a shared output space,i.e. YsYu ,but a disjoint set of domains between training and test i.e. DsDu ,thus Du is a set of unseen domains. Since the goal of our work is to recognize unseen classes in unseen domains, we unify the settings of DG and ZSL, considering both semantic- and domain-shift at test time i.e. YsYu and DsDu .
给定S,我们的目标是学习一个函数h,该函数将领域DuD中的图像x映射到类别集合YuY中对应的标签。注意,在标准的零样本学习中,虽然训练和测试领域集合是相同的,即DsDu,但标签集合是不相交的,即YsYu,因此Yu是未见类别的集合。另一方面,在领域泛化问题中,我们有一个共享的输出空间,即YsYu,但训练和测试领域集合是不相交的,即DsDu,因此Du是未见领域的集合。由于我们工作的目标是识别未见领域中的未见类别,我们统一了领域泛化和零样本学习的设置,在测试时同时考虑语义和领域的偏移,即YsYuDsDu
In the following we divide the function h into three parts: f ,mapping images into a feature space Z ,i.e. f:XZ,g going from Z to a semantic embedding space E ,i.e. g:ZE ,and an embedding function ω:YtE where YtYs during training and YtYu at test time. Note that ω is a learned classifier for DG while it is a fixed semantic embedding function in ZSL, mapping classes into their vectorized representation extracted from external sources. Given an image x ,the final class prediction is obtained as follows:
接下来,我们将函数h分为三个部分:f,将图像映射到特征空间Z,即f:XZ,g;从Z到语义嵌入空间E,即g:ZE;以及一个嵌入函数ω:YtE,其中在训练时为YtYs,在测试时为YtYu。注意,ω在领域泛化中是一个学习到的分类器,而在零样本学习中是一个固定的语义嵌入函数,它将类别映射到从外部源提取的向量化表示。给定一幅图像x,最终的类别预测如下获得:

(1)y=argmaxyω(y)g(f(x)).

In this formulation, f can be any learnable feature extractor (e.g. a deep neural network),while g any ZSL predictor (e.g. a semantic projection layer,as in [46] or a compatibility function among visual features and labels, as in [12]). The first solution to address the ZSL+DG problem could be training a classifier using the aggregation of data from all source domains. In particular, for each sample we could minimize a loss function of the form:
在这种公式化表达中,f 可以是任何可学习的特征提取器(例如深度神经网络),而 g 可以是任何零样本学习(ZSL)预测器(例如,如文献 [46] 中的语义投影层,或如文献 [12] 中的视觉特征与标签之间的兼容性函数)。解决零样本学习与领域泛化(ZSL+DG)问题的第一种解决方案可能是使用来自所有源领域的数据聚合来训练一个分类器。具体而言,对于每个样本,我们可以最小化以下形式的损失函数:

(2)LAGG(xi,yi)=yYs(ω(y)g(f(xi)),yi)

with an arbitrary loss function,e.g. the cross-entropy loss. In the following, we show how we can use the input to Eq. 2 to effectively recognize unseen categories in unseen domains.
其中 是任意损失函数,例如交叉熵损失。在接下来的内容中,我们将展示如何使用公式 2 的输入来有效识别未见领域中的未见类别。

3.2 Simulating Unseen Domains and Concepts through Mixup
3.2 通过混合(Mixup)模拟未见领域和概念


The fundamental problem of ZSL+DG is that, during training, we have neither access to visual data associated to categories in Yu nor to data of the unseen domains Du . One way to overcome this issue in ZSL is to generate samples of unseen classes by learning a generative function conditioned on the semantic embeddings in W={ω(y)yYs} [4850]. However,since no description is available for the unseen target domain(s) in Du ,this strategy is not feasible in ZSL+DG. On the other hand, previous works on DG proposed to synthesize images of unseen domains through adversarial strategies of data augmentation 4339 . However, these strategies are not applied to ZSL since they cannot easily be extended to generate data for unseen semantic categories Yu .
零样本学习与领域泛化(ZSL+DG)的根本问题在于,在训练过程中,我们既无法获取与 Yu 中的类别相关的视觉数据,也无法获取未见领域 Du 的数据。在零样本学习(ZSL)中克服这一问题的一种方法是通过学习一个以 W={ω(y)yYs} 中的语义嵌入为条件的生成函数来生成未见类别的样本 [48 - 50]。然而,由于对于 Du 中的未见目标领域没有可用的描述,这种策略在零样本学习与领域泛化(ZSL+DG)中不可行。另一方面,先前关于领域泛化(DG)的工作提出通过对抗性数据增强策略来合成未见领域的图像 [43 - 39]。然而,这些策略不适用于零样本学习(ZSL),因为它们不容易扩展以生成未见语义类别 Yu 的数据。

To circumvent this issue, we introduce a strategy to simulate, during training, novel domains and semantic concepts by interpolating from the ones available in Ds and Ys . Simulating novel domains and classes allows to train the network to cope with both semantic- and domain-shift, the same situation our model will face at test time. Since explicitly generating inputs of novel domains and categories is a complex task, in this work we propose to achieve this goal, by mixing images and features of different classes and domains, revisiting the popular mixup [53] strategy.
为了规避这个问题,我们引入了一种策略,在训练期间通过对 DsYs 中可用的领域和语义概念进行插值来模拟新的领域和语义概念。模拟新的领域和类别可以训练网络以应对语义和领域的偏移,这也是我们的模型在测试时将面临的相同情况。由于显式生成新领域和类别的输入是一项复杂的任务,在这项工作中,我们建议通过混合不同类别和领域的图像和特征来实现这一目标,重新采用流行的混合(Mixup)策略 [53]。



Fig. 2. Our CuMix Framework. Given an image (bottom, horse, photo), we randomly sample one image from the same (middle, photo) and one from another (top, cartoon) domain. The samples are mixed through ϕ (white blocks) both at image and feature level,with their features and labels projected into the embedding space E (by g and ω respectively) and there compared to compute our final objective. Note that ϕ varies during training (top part), changing the mixing ratios in and across domains.
图 2. 我们的 CuMix 框架。给定一张图像(底部,马,照片),我们从同一领域(中间,照片)随机采样一张图像,并从另一个领域(顶部,卡通)随机采样一张图像。这些样本在图像和特征层面通过 ϕ(白色方块)进行混合,它们的特征和标签分别通过 gω 投影到嵌入空间 E 中,并在那里进行比较以计算我们的最终目标。请注意,ϕ 在训练过程中会发生变化(顶部部分),从而改变领域内和跨领域的混合比例。


In practice,given two elements ai and aj of the same space (e.g. ai,ajX ), mixup [53] defines a mixing function φ as follows:
实际上,给定同一空间(例如 ai,ajX)中的两个元素 aiaj,混合(Mixup)[53] 定义了一个混合函数 φ 如下:

(3)φ(ai,aj)=λai+(1λ)aj

with λ sampled from a beta distribution,i.e. λBeta(β,β) ,with β an hyperpa-rameter. Given two samples (xi,yi) and (xj,yj) randomly drawn from a training set T ,a new loss term is defined as:
其中 λ 是从贝塔分布中采样得到的,即 λBeta(β,β),其中 β 是一个超参数。给定从训练集 T 中随机抽取的两个样本 (xi,yi)(xj,yj),定义一个新的损失项为:

(4)LMIXUP ((xi,yi),(xj,yj))=LAGG(φ(xi,xj),φ(y¯i,y¯j))

where y¯i|Ys| is the one-hot vectorized representation of label yi . Note that, when mixing two samples and label vectors with φ ,a single λ is drawn and applied within φ in both image and label spaces. The loss defined in Eq. 4 forces the network to disentangle the various semantic components (i.e. yi and yj ) contained in the mixed inputs (i.e. xi and xj ) plus the ratio λ used to mix them. This auxiliar task acts as a strong regularizer that helps the network to e.g. being more robust against adversarial examples [53]. Note however that the function φ creates input and targets which do not represent a single semantic concept in T but contains characteristics taken from multiple samples and categories, synthesising a new semantic concept from the interpolation of existing ones.
其中 y¯i|Ys| 是标签 yi 的独热向量表示。请注意,当使用 φ 混合两个样本和标签向量时,会抽取一个单独的 λ 并将其应用于 φ 中的图像空间和标签空间。公式4中定义的损失迫使网络解开混合输入(即 xixj )中包含的各种语义成分(即 yiyj )以及用于混合它们的比例 λ 。这个辅助任务起到了强大的正则化作用,有助于网络例如对对抗样本更加鲁棒 [53]。然而,请注意,函数 φ 创建的输入和目标在 T 中并不代表单一的语义概念,而是包含了从多个样本和类别中提取的特征,通过对现有语义概念的插值合成了一个新的语义概念。

For recognizing unseen concepts in unseen domains at test time, we revisit φ to obtain both cross-domain and cross-semantic mixes during training,simulating both semantic- and domain-shift. While simulating the semantic-shift is a by-product of the original mixup formulation,here we explicitly revisit φ in order to perform cross-domain mixups. In particular, instead of considering a pair of samples from our training set,we consider a triplet (xi,yi,di),(xj,yj,dj) and (xk,yk,dk) . Given (xi,yi,di) ,the other two elements of the triplet are randomly sampled from S ,with the only constraint that di=dk,ik and djdi . In this way,the triplet contains two samples of the same domain (i.e. di ) and a third of a different one (i.e.dj) . Then,our mixing function ϕ is defined as follows:
为了在测试时识别未见领域中的未见概念,我们重新审视 φ 以在训练期间获得跨领域和跨语义的混合,同时模拟语义和领域的偏移。虽然模拟语义偏移是原始混合增强(mixup)公式的一个副产品,但在这里我们明确地重新审视 φ 以执行跨领域混合。具体来说,我们不考虑从训练集中选取一对样本,而是考虑一个三元组 (xi,yi,di),(xj,yj,dj)(xk,yk,dk) 。给定 (xi,yi,di) ,三元组的另外两个元素从 S 中随机采样,唯一的约束条件是 di=dk,ikdjdi 。通过这种方式,三元组包含了来自同一领域的两个样本(即 di )和来自不同领域的第三个样本 (i.e.dj) 。然后,我们的混合函数 ϕ 定义如下:
(5)ϕ(ai,aj,ak)=λai+(1λ)(γaj+(1γ)ak)

with γ sampled from a Bernoulli distribution γB(α) and a representing either the input x or the vectorized version of the label y ,i.e. y¯ . Note that we introduced a term γ which allows to perform either intra-domain (with γ=0 ) or cross-domain (with γ=1 ) mixes.
其中 γ 从伯努利分布 γB(α) 中采样, a 表示输入 x 或标签 y 的向量化版本,即 y¯ 。请注意,我们引入了一个项 γ ,它允许执行域内(当 γ=0 时)或跨域(当 γ=1 时)混合。

To learn a feature extractor f and a semantic projection layer g robust to domain- and semantic-shift,we propose to use ϕ to simulate both samples and features of novel domains and classes during training. Namely, we simulate the semantic- and domain-shift at two levels, i.e. image and class levels. Given a sample (xi,yi,di)S we define the following loss:
为了学习一个对领域和语义偏移具有鲁棒性的特征提取器 f 和语义投影层 g ,我们建议在训练期间使用 ϕ 来模拟新领域和新类别的样本和特征。具体来说,我们在两个层面上模拟语义和领域的偏移,即图像层面和类别层面。给定一个样本 (xi,yi,di)S ,我们定义以下损失:

(6)LM-IMG (xi,yi,di)=LAGG(ϕ(xi,xj,xk),ϕ(y¯i,y¯j,y¯k)).

where (xi,yi,di),(xj,yj,dj),(xk,yk,dk) are randomly sampled from S ,with di= dk and djdk . The loss term in Eq. 6 enforces the feature extractor to effectively process inputs of mixed domains/semantics obtained through ϕ . Additionally, to also act at classification level, we design another loss which forces the semantic consistency of mixed features in E . This loss term is defined as:
其中 (xi,yi,di),(xj,yj,dj),(xk,yk,dk)S 中随机采样,且 di= dkdjdk 。公式6中的损失项迫使特征提取器有效地处理通过 ϕ 获得的混合领域/语义的输入。此外,为了在分类层面也发挥作用,我们设计了另一个损失,它迫使 E 中混合特征的语义一致性。这个损失项定义为:

(7)LMF(xi,yi,di)=yYs(ω(y)g(ϕ(f(xi),f(xj),f(xk))),ϕ(y¯i,y¯j,y¯k))

where,as before, (xj,yj,dj),(xk,yk,dk)S ,with di=dk,ik and djdk and is a generic loss function e.g. the cross-entropy loss. This second loss term forces the classifier ω and the semantic projection layer g to be robust to features with mixed domains and semantics.
和之前一样,其中 (xj,yj,dj),(xk,yk,dk)S ,且 di=dk,ikdjdk 是一个通用的损失函数,例如交叉熵损失。这第二个损失项迫使分类器 ω 和语义投影层 g 对具有混合领域和语义的特征具有鲁棒性。

While we can simply use a fixed mixing function ϕ ,as defined in Eq. (5),for Eq. (6) and Eq. (7),we found that it is more beneficial to devise a dynamic ϕ which changes its behaviour during training, in a curriculum fashion. Intuitively, minimizing the two objectives defined in Eq. 6 and Eq. 7) requires our model to correctly disentangle the various semantic components used to form the mixed samples. While this is a complex task even for intra-domain mixes (i.e. when only the semantic is mixed), mixing samples across domains makes the task even harder, requiring to isolate also domain specific factors. To effectively tackle this task,we choose to act on the mixing function ϕ . In particular,we want our ϕ to create mixed samples with progressively increased degree of mixing both with respect to content and domain, in a curriculum-based fashion.
虽然我们可以简单地使用一个固定的混合函数 ϕ(如式 (5) 所定义)来处理式 (6) 和式 (7),但我们发现设计一个动态的 ϕ 更有益,该函数在训练过程中以课程式的方式改变其行为。直观地说,最小化式 (6) 和式 (7) 中定义的两个目标要求我们的模型正确地分离用于形成混合样本的各种语义成分。即使对于域内混合(即仅混合语义时),这也是一项复杂的任务,而跨域混合样本会使任务变得更加困难,还需要分离特定于域的因素。为了有效地处理这项任务,我们选择对混合函数 ϕ 进行操作。具体来说,我们希望我们的 ϕ 以基于课程的方式创建在内容和领域方面混合程度逐渐增加的混合样本。

During training we regulate both α (weighting the probability of cross-domain mixes) and β (modifying the probability distribution of the mix ratio λ ),changing the probability distribution of the mixing ratio λ and of the cross-domain mix γ . In particular,given a warm-up step of N epochs and being s the current epoch we set β=min(sNβmax,βmax) ),with βmax as hyperparameter, while α=max(0,min(sNN,1) . As a consequence,the learning process is made of three phases, with a smooth transition among them. We start by solving the plain classification task on a single domain (i.e. s<N,α=0,β=sNβmax ,). In the subsequent step (Ns<2N) samples of the same domains are mixed randomly,with possibly different semantics (i.e. α=sNN,β=βmax ). In the third phase (s2N) ,we mix up samples of different domains (i.e. α=1 ), simulating the domain shift the predictor will face at test time. Figure 2, shows a representation of how ϕ varies during training (top,white block).
在训练过程中,我们同时调节 α(对跨域混合的概率进行加权)和 β(修改混合比例 λ 的概率分布),从而改变混合比例 λ 和跨域混合 γ 的概率分布。具体来说,给定一个 N 个轮次的预热步骤,并且 s 是当前轮次,我们设置 β=min(sNβmax,βmax) ,其中 βmax 是超参数,而 α=max(0,min(sNN,1) 。因此,学习过程由三个阶段组成,它们之间有平滑的过渡。我们首先解决单个域上的简单分类任务(即 s<N,α=0,β=sNβmax )。在后续步骤 (Ns<2N) 中,同一域的样本被随机混合,可能具有不同的语义(即 α=sNN,β=βmax )。在第三阶段 (s2N) 中,我们混合不同域的样本(即 α=1 ),模拟预测器在测试时将面临的域偏移。图 2 展示了 ϕ 在训练过程中如何变化的表示(顶部,白色块)。
Final objective. The full training procedure, is represented in Figure 2 Given a training sample (xi,yi,di) ,we randomly draw other two samples, (xj,yj,dj) and (xk,yk,dk) ,with di=dk,ik and djdi ,feed them to ϕ and obtain the first mixed input. We then feed xi,xj,xk and the mixed sample through f ,to extract their respective features. At this point we use features extracted from other two randomly drawn samples (in the figure,and just for simplicity, xj and xk with same mixing ratios λ and γ ),to obtain the feature level mixed features needed to build the objective in Eq. 7). Finally,the features of xi and the two mixed variants at image and feature level, are fed to the semantic projection layer g ,which maps them to the embedding space E . At the same time,the labels in Ys are projected in E through ω . Finally,the objectives defined in Eq. (2),Eq. (6) and Eq. (7) functions are then computed in the semantic embedding space. Our final objective is:
最终目标。完整的训练过程如图2所示。给定一个训练样本(xi,yi,di),我们随机抽取另外两个样本(xj,yj,dj)(xk,yk,dk),其中di=dk,ikdjdi,将它们输入到ϕ中,得到第一个混合输入。然后,我们将xi,xj,xk和混合样本通过f,以提取它们各自的特征。此时,我们使用从另外两个随机抽取的样本(在图中,为简单起见,xjxk具有相同的混合比例λγ)中提取的特征,以获得构建公式7中目标所需的特征级混合特征。最后,将xi的特征以及图像和特征级的两个混合变体输入到语义投影层g,该层将它们映射到嵌入空间E。同时,Ys中的标签通过ω投影到E中。最后,在语义嵌入空间中计算公式(2)、公式(6)和公式(7)中定义的目标函数。我们的最终目标是:

LCuMIX (S)=|S|1LAGG (xi,yi)+ηILM-IMG (xi,yi,di)+ηFLM-F (xi,yi,di)(xi,yi,di)S

(8)

with ηI and ηF hyperparameters weighting the importance of the two terms. As (x,y) in both LAGG,LM-IMG  and LM-F  ,we use the standard cross-entropy loss, even if any ZSL objective can be applied. Finally, we highlight that the optimization is performed batch-wise, thus also the sampling of the triplet considers the current batch and not the full training set S . Moreover,while in Figure 2 we show for simplicity that the same samples are drawn for LM-IMG  and LM-F  , in practice, given a sample, the random sampling procedure of the other two members of the triplet is held-out twice, one at the image level and one at the feature level. Similarly,the sampling of the mixing ratios λ and cross domain factor γ of ϕ is held-out sample-wise and twice,one at image level and one at feature level. As in Eq. (3), λ and γ are kept fixed across mixed inputs/features and their respective targets in the label space.
其中ηIηF是对两项重要性进行加权的超参数。由于(x,y)LAGG,LM-IMG LM-F 中,我们使用标准的交叉熵损失,即使可以应用任何零样本学习(ZSL)目标。最后,我们强调优化是按批次进行的,因此三元组的采样也考虑当前批次,而不是整个训练集S。此外,虽然在图2中为简单起见,我们显示为LM-IMG LM-F 抽取了相同的样本,但实际上,给定一个样本,三元组中另外两个成员的随机采样过程会进行两次,一次在图像级别,一次在特征级别。类似地,ϕ的混合比例λ和跨域因子γ的采样也是按样本进行两次,一次在图像级别,一次在特征级别。如公式(3)所示,λγ在混合输入/特征及其在标签空间中的相应目标之间保持固定。

Discussion. We present similarities between our framework with DG and ZSL methods. In particular, presenting the classifier with noisy features extracted by a non-domain specialist network, has a similar goal as the episodic strategy for DG described in [21]. On the other hand, here we sidestep the need to train domain experts by directly presenting as input to our classifier features of novel domains that we obtain by interpolating the available sources samples. Our method is also linked to mixup approaches developed in DA 51. Differently from them, we use mixup to simulate unseen domains rather then to progressively align the source to the given target data.
讨论。我们展示了我们的框架与领域泛化(DG)和零样本学习(ZSL)方法之间的相似性。特别是,向分类器提供由非领域专家网络提取的噪声特征,与文献[21]中描述的领域泛化的情节策略具有相似的目标。另一方面,在这里,我们通过直接将通过对可用源样本进行插值获得的新领域特征作为输入提供给我们的分类器,从而避免了训练领域专家的需要。我们的方法还与领域适应(DA)中开发的混合(mixup)方法相关。与它们不同的是,我们使用混合来模拟未见领域,而不是逐步将源数据与给定的目标数据对齐。



Fig. 3. ZSL results on CUB, SUN, AWA and FLO datasets with ResNet-101 features.
图3. 使用ResNet - 101特征在CUB、SUN、AWA和FLO数据集上的零样本学习(ZSL)结果。


Our method is also related to ZSL frameworks based on feature generation 4850 . While the quality of our synthesized samples is lower since we do not exploit attributes for conditional generation, we have a lower computational cost. In fact, during training we simulate the test-time semantic shift without generating samples of unseen classes. Moreover, we do not require additional training phases on the generated samples or the availability of unseen class attributes to be available beforehand.
我们的方法还与基于特征生成的零样本学习(ZSL)框架相关[48,50]。虽然我们合成样本的质量较低,因为我们没有利用属性进行条件生成,但我们的计算成本较低。事实上,在训练过程中,我们在不生成未见类样本的情况下模拟测试时的语义转移。此外,我们不需要对生成的样本进行额外的训练阶段,也不需要事先提供未见类的属性。

4 Experimental results
4 实验结果


4.1 Datasets and implementation details
4.1 数据集和实现细节


We assess CuMix in three scenarios: ZSL, DG and the proposed ZSL+DG setting. ZSL. We conduct experiments on four standard benchmarks: Caltech-UCSD-Birds 200-2011 (CUB) [44, SUN attribute (SUN) [31], Animals with Attributes (AWA) [17] and Oxford Flowers (FLO) [29]. CUB contains 11,788 images of 200 bird species, with 312 attributes, SUN 14,430 images of 717 scenes annotated with 102 attributes, and AWA 30,475 images of 50 animal categories with 85 attributes. Finally, FLO is a fine-grained dataset of flowers, containing 8,189 images of 102 categories. As semantic representation, we use the visual descriptions of [35, following [48, 46]. For each dataset, we use the train, validation and test split provided by [47]. In all the settings we employ features extracted from the second-last layer of a ResNet-101 [13] pretrained on ImageNet as image representation. For CuMix,we consider f as the identity function and as g a simple fully connected layer, perform our version of mixup directly at the feature-level while applying our alignment loss in the embedding space. All hyperparameters have been set following [47].
我们在三种场景下评估了CuMix:零样本学习(ZSL)、领域泛化(DG)以及我们提出的ZSL+DG设置。零样本学习(ZSL)。我们在四个标准基准数据集上进行了实验:加州理工学院 - 加州大学圣地亚哥分校鸟类数据集200 - 2011(Caltech - UCSD - Birds 200 - 2011,CUB) [44]、SUN属性数据集(SUN attribute,SUN) [31]、带属性的动物数据集(Animals with Attributes,AWA) [17]和牛津花卉数据集(Oxford Flowers,FLO) [29]。CUB包含200种鸟类的11788张图像,有312个属性;SUN包含717个场景的14430张图像,标注了102个属性;AWA包含50种动物类别的30475张图像,有85个属性。最后,FLO是一个细粒度的花卉数据集,包含102个类别的8189张图像。作为语义表示,我们按照文献[48, 46]使用文献[35]中的视觉描述。对于每个数据集,我们使用文献[47]提供的训练集、验证集和测试集划分。在所有设置中,我们使用在ImageNet上预训练的ResNet - 101 [13]倒数第二层提取的特征作为图像表示。对于CuMix,我们将f视为恒等函数,将g视为一个简单的全连接层,直接在特征级别执行我们版本的混合操作(mixup),同时在嵌入空间中应用我们的对齐损失。所有超参数均按照文献[47]进行设置。

DG. We perform experiments on the PACS dataset [19] with 9,991 images of 7 semantic classes in 4 different visual domains, art paintings, cartoons, photos and sketches. For this experiment we use the standard train and test split defined in [19], with the same validation protocol. We use as base architecture a ResNet-18 13 pretrained on ImageNet. For our model,we consider f to be the ResNet-18 while g to be the identity function. We use the same training hyperparameters and protocol of [21].
领域泛化(DG)。我们在PACS数据集 [19]上进行实验,该数据集包含4个不同视觉领域(艺术绘画、卡通、照片和素描)中7个语义类别的9991张图像。对于这个实验,我们使用文献[19]中定义的标准训练集和测试集划分,以及相同的验证协议。我们使用在ImageNet上预训练的ResNet - 18 [13]作为基础架构。对于我们的模型,我们将f视为ResNet - 18,将g视为恒等函数。我们使用与文献[21]相同的训练超参数和协议。

Table 2. Ablation on PACS dataset with ResNet-18 as backbone.
表2. 以ResNet - 18为骨干网络在PACS数据集上的消融实验。

LAGG LM-IMG LM-F CurriculumArtCartoonPhotoSketchAvg.
76.173.894.969.478.5
78.472.794.759.576.3
81.876.594.971.281.1
✓✓82.775.495.471.581.2
✓✓✓82.376.595.172.681.6
LAGG LM-IMG LM-F 课程(Curriculum)艺术(Art)卡通(Cartoon)照片(Photo)素描(Sketch)平均(Avg.)
76.173.894.969.478.5
78.472.794.759.576.3
81.876.594.971.281.1
✓✓82.775.495.471.581.2
✓✓✓82.376.595.172.681.6


Ablation study. In this section, we ablate the various components of our method. We performed the ablation on the PACS benchmark for DG, since this allows us to show how different choices act on the generalization to unseen domains. In particular, we ablate the following implementation choices: 1) mixing samples at the image level, feature level or both 2) impact of our curriculum-based strategy for mixing features and samples.
消融研究。在本节中,我们对我们方法的各个组件进行消融实验。我们在用于领域泛化(DG)的PACS基准上进行了消融实验,因为这使我们能够展示不同选择对未见过领域的泛化能力的影响。具体而言,我们对以下实现选择进行消融实验:1) 在图像级别、特征级别或同时在这两个级别混合样本;2) 我们基于课程的特征和样本混合策略的影响。

As shown in Table 5, mixing samples at feature level produces a clear gain on the results with respect to the baseline, while mixing samples only at image level can even harm the performance. This happens particularly in the sketch domain, where mixing samples at feature level produces a gain of 2% while at image level we observe a drop of 10% with respect to the baseline. This could be explained by mixing samples at image level producing inputs that are too noisy for the network and not representative of the actual shift experienced at test time. Mixing samples at feature level instead, after multiple layers of abstractions, allows to better synthesize the information contained in the different samples, leading to more reliable features for the classifier. Using both of them allows to obtain higher results in almost all domains.
如表5所示,在特征级别混合样本相对于基线在结果上有明显提升,而仅在图像级别混合样本甚至可能损害性能。这种情况在素描领域尤为明显,在该领域,在特征级别混合样本相对于基线有2%的提升,而在图像级别相对于基线我们观察到10%的下降。这可以解释为,在图像级别混合样本会产生对网络来说噪声过大的输入,并且不能代表测试时实际经历的分布偏移。相反,在特征级别混合样本,经过多层抽象后,可以更好地综合不同样本中包含的信息,从而为分类器提供更可靠的特征。同时使用这两种方式几乎在所有领域都能获得更高的结果。

Finally, we analyze the impact of the curriculum-based strategy for mixing samples and features. As the table shows, adding the curriculum strategy allows to boost the performances for the most difficult cases (i.e. sketches) producing a further accuracy boost. Moreover, applying this strategy allows to stabilize the training procedure, as demonstrated experimentally.
最后,我们分析基于课程的样本和特征混合策略的影响。如表所示,添加课程策略可以提升最困难情况(即素描)的性能,进一步提高准确率。此外,实验证明,应用该策略可以稳定训练过程。

ZSL+DG. On the proposed ZSL+DG setting we use the DomainNet dataset, training on five out of six domains and reporting the average per-class accuracy on the held-out one. We report the results for all possible target domains but one, i.e. real photos, since our backbone has been pretrained on ImageNet, thus the photo domain is not an unseen one. Since no previous methods addressed the ZSL+DG problem, in this work we consider simple baselines derived from the literature of both ZSL and DG. The first baseline is a standard ZSL model without any DG algorithm (i.e. the standard AGG): as ZSL method we consider SPNet [46] . The second baseline is a DG approach coupled with a ZSL algorithm. To this extent we select the state-of-the-art Epi-FCR as the DG approach, coupling it with SPNet. As a reference, we also evaluate the performance of standard mixup coupled with SPNet.
零样本学习+领域泛化(ZSL+DG)。在提出的ZSL+DG设置中,我们使用DomainNet数据集,在六个领域中的五个领域上进行训练,并报告在保留领域上的每类平均准确率。我们报告除了一个领域(即真实照片)之外所有可能目标领域的结果,因为我们的骨干网络在ImageNet上进行了预训练,因此照片领域不是未见过的领域。由于之前没有方法解决ZSL+DG问题,在这项工作中,我们考虑从ZSL和DG文献中衍生出的简单基线。第一个基线是没有任何DG算法的标准ZSL模型(即标准AGG):作为ZSL方法,我们考虑SPNet [46]。第二个基线是将DG方法与ZSL算法相结合。为此,我们选择最先进的Epi - FCR作为DG方法,并将其与SPNet相结合。作为参考,我们还评估了标准混合增强(mixup)与SPNet相结合的性能。

Table 3. ZSL+DG scenario on the DomainNet dataset with ResNet-50 as backbone.
表3. 以ResNet - 50为骨干网络的DomainNet数据集上的ZSL+DG场景。

MethodClipartInfographPaintingQuickdrawSketchAvg.
SPNet26.016.923.88.221.819.4
mixup+SPNet27.216.924.78.521.319.7
Epi-FCR+SPNet26.416.724.69.223.220.0
CuMix27.617.825.59.922.620.7
方法剪贴画信息图绘画快速绘图素描平均值
SP网络(SPNet)26.016.923.88.221.819.4
混合增强+SP网络(mixup+SPNet)27.216.924.78.521.319.7
表型特征对比正则化+SP网络(Epi-FCR+SPNet)26.416.724.69.223.220.0
CuMix(未找到通用译法,保留原文)27.617.825.59.922.620.7


As shown in Table 3, our method achieves competitive performances in ZSL+DG setting when compared to a state-of-the-art approach for DG (Epi-FCR) coupled with a state-of-the-art one for ZSL (SPNet), outperforming this baseline in almost all settings but sketch and, in average by almost 1%. Particularly interesting are the results on the infograph and quickdraw domains. These two domains are the ones where the shift is more evident as highlighted by the lower results of the baseline. In these scenarios, our model consistently outperforms the competitors, with a remarkable gain of more than 1.5% in average accuracy per class with respect to the ZSL only baseline. We want to highlight also that DomainNet is a challenging dataset, where almost all standard DA approaches are ineffective or can even lead to negative transfer [33]. Our method however is able to overcome the unseen domain shift at test time, improving the performance of the baselines in all scenarios. Our model consistently outperforms SPNet coupled with the standard mixup strategy in every scenario. This demonstrates the efficacy of the choices in CuMix for revisiting mixup in order to recognize unseen categories in unseen domains.
如表3所示,与用于领域泛化(DG)的先进方法(Epi - FCR)和用于零样本学习(ZSL)的先进方法(SPNet)相结合的方法相比,我们的方法在ZSL + DG设置下取得了有竞争力的性能,除了草图设置外,在几乎所有设置中都优于该基线,平均高出近1%。特别值得关注的是信息图和快速绘图领域的结果。正如基线较低的结果所突出显示的那样,这两个领域的领域偏移更为明显。在这些场景中,我们的模型始终优于竞争对手,相对于仅使用ZSL的基线,每类平均准确率显著提高了超过1.5%。我们还想强调的是,DomainNet是一个具有挑战性的数据集,几乎所有标准的领域自适应(DA)方法在该数据集上都无效,甚至可能导致负迁移[33]。然而,我们的方法能够在测试时克服未见领域的偏移,在所有场景中提高了基线的性能。我们的模型在每个场景中始终优于结合标准混合策略的SPNet。这证明了CuMix在重新审视混合策略以识别未见领域中的未见类别的选择是有效的。

5 Conclusions
5 结论


In this work, we propose the novel ZSL+DG scenario. In this setting, during training, we are given a set of images of multiple domains and semantic categories and our goal is to build a model able to recognize unseen concepts, as in ZSL, in unseen domains, as in DG. To solve this problem we design CuMix, the first algorithm which can be holistically and effectively applied to DG, ZSL and ZSL+DG. CuMix is based on simulating inputs and features of new domains and categories during training by mixing the available source domains and classes, both at image and feature level. Experiments on public benchmarks show the effectiveness of CuMix, achieving state-of-the-art performances in almost all settings in all tasks. Future works will investigate the use of alternative data-augmentation schemes in the ZSL+DG setting.
在这项工作中,我们提出了新颖的ZSL + DG场景。在这种设置下,在训练期间,我们会得到多个领域和语义类别的一组图像,我们的目标是构建一个模型,该模型能够像在ZSL中一样识别未见概念,像在DG中一样在未见领域中进行识别。为了解决这个问题,我们设计了CuMix,这是第一个可以全面且有效地应用于DG、ZSL和ZSL + DG的算法。CuMix基于在训练期间通过在图像和特征级别混合可用的源领域和类别来模拟新领域和类别的输入和特征。在公共基准上的实验表明了CuMix的有效性,在所有任务的几乎所有设置中都取得了先进的性能。未来的工作将研究在ZSL + DG设置中使用替代的数据增强方案。

Acknowledgments We thank the ELLIS Ph.D. student program and the ERC grants 637076 - RoboExNovo (B.C.) and 853489 - DEXIM (Z.A.). This work has been partially funded by the DFG under Germanys Excellence Strategy EXC number 2064/1 Project number 390727645.
致谢 我们感谢ELLIS博士研究生项目以及欧洲研究理事会(ERC)的资助,项目编号分别为637076 - RoboExNovo(B.C.)和853489 - DEXIM(Z.A.)。这项工作部分得到了德国研究基金会(DFG)在德国卓越战略下的资助,卓越集群编号为2064/1,项目编号为390727645。

References
参考文献


  1. Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for attribute-based classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 819-826 (2013)
  1. Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.:基于属性分类的标签嵌入。见:IEEE计算机视觉与模式识别会议论文集。第819 - 826页(2013年)

  1. Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embed-dings for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2927-2936 (2015)
  1. Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.:细粒度图像分类输出嵌入的评估。见:IEEE计算机视觉与模式识别会议论文集。第2927 - 2936页(2015年)

  1. Balaji, Y., Sankaranarayanan, S., Chellappa, R.: Metareg: Towards domain generalization using meta-regularization. In: Advances in Neural Information Processing Systems. pp. 998-1008 (2018)
  1. Balaji, Y., Sankaranarayanan, S., Chellappa, R.:元正则化:使用元正则化实现领域泛化。见:神经信息处理系统进展。第998 - 1008页(2018年)

  1. Bendale, A., Boult, T.: Towards open world recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1893-1902 (2015)
  1. Bendale, A., Boult, T.:迈向开放世界识别。见:IEEE计算机视觉与模式识别会议论文集。第1893 - 1902页(2015年)

  1. Carlucci, F.M., D'Innocente, A., Bucci, S., Caputo, B., Tommasi, T.: Domain generalization by solving jigsaw puzzles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2229-2238 (2019)
  1. Carlucci, F.M., D'Innocente, A., Bucci, S., Caputo, B., Tommasi, T.:通过解决拼图难题实现领域泛化。见:IEEE计算机视觉与模式识别会议论文集。第2229 - 2238页(2019年)

  1. Changpinyo, S., Chao, W.L., Gong, B., Sha, F.: Synthesized classifiers for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5327-5336 (2016)
  1. Changpinyo, S., Chao, W.L., Gong, B., Sha, F.:用于零样本学习的合成分类器。见:IEEE计算机视觉与模式识别会议论文集。第5327 - 5336页(2016年)

  1. Csurka, G.: A Comprehensive Survey on Domain Adaptation for Visual Applications, pp. 1-35. Springer International Publishing (2017)
  1. Csurka, G.:视觉应用领域自适应综合调查,第1 - 35页。施普林格国际出版公司(2017年)

  1. Dutta, A., Akata, Z.: Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5089-5098 (2019)
  1. Dutta, A., Akata, Z.:用于零样本基于草图的图像检索的语义绑定配对循环一致性。见:IEEE计算机视觉与模式识别会议论文集。第5089 - 5098页(2019年)

  1. Fu, Y., Hospedales, T.M., Xiang, T., Gong, S.: Transductive multi-view zero-shot learning. IEEE transactions on pattern analysis and machine intelligence 37(11) , 2332-2345 (2015)
  1. Fu, Y., Hospedales, T.M., Xiang, T., Gong, S.:转导多视图零样本学习。IEEE模式分析与机器智能汇刊 37(11) ,2332 - 2345(2015年)

  1. Gan, C., Yang, T., Gong, B.: Learning attributes equals multi-source domain generalization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 87-97 (2016)
  1. 甘(Gan),C.;杨(Yang),T.;龚(Gong),B.:学习属性等同于多源领域泛化。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第87 - 97页(2016年)

  1. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17(1),20962030 (2016)
  1. 加宁(Ganin),Y.;乌斯蒂诺娃(Ustinova),E.;阿贾坎(Ajakan),H.;热尔曼(Germain),P.;拉罗谢尔(Larochelle),H.;拉维奥莱特(Laviolette),F.;马尔尚(Marchand),M.;伦皮茨基(Lempitsky),V.:神经网络的领域对抗训练。《机器学习研究杂志》 17(1),20962030 (2016年)

  1. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 1440-1448 (2015)
  1. 吉尔希克(Girshick),R.:快速区域卷积神经网络(Fast R - CNN)。见:《电气与电子工程师协会国际计算机视觉会议论文集》。第1440 - 1448页(2015年)

  1. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770778(2016)
  1. 何(He),K.;张(Zhang),X.;任(Ren),S.;孙(Sun),J.:用于图像识别的深度残差学习。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第 770778(2016)

  1. Hoffman, J., Darrell, T., Saenko, K.: Continuous manifold based adaptation for evolving visual domains. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 867-874 (2014)
  1. 霍夫曼(Hoffman),J.;达雷尔(Darrell),T.;塞内科(Saenko),K.:基于连续流形的演化视觉领域自适应。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第867 - 874页(2014年)

  1. Khosla, A., Zhou, T., Malisiewicz, T., Efros, A.A., Torralba, A.: Undoing the damage of dataset bias. In: European Conference on Computer Vision. pp. 158-171 (2012)
  1. 科斯拉(Khosla),A.;周(Zhou),T.;马利塞维奇(Malisiewicz),T.;埃弗罗斯(Efros),A.A.;托拉尔巴(Torralba),A.:消除数据集偏差的影响。见:《欧洲计算机视觉会议论文集》。第158 - 171页(2012年)

  1. Kodirov, E., Xiang, T., Fu, Z., Gong, S.: Unsupervised domain adaptation for zero-shot learning. In: Proceedings of the IEEE international conference on computer vision. pp. 2452-2460 (2015)
  1. 科迪罗夫(Kodirov),E.;向(Xiang),T.;傅(Fu),Z.;龚(Gong),S.:无监督领域自适应用于零样本学习。见:《电气与电子工程师协会国际计算机视觉会议论文集》。第2452 - 2460页(2015年)

  1. Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence 36(3),453465(2013)
  1. 兰佩特(Lampert),C.H.;尼基施(Nickisch),H.;哈梅林(Harmeling),S.:基于属性的分类用于零样本视觉目标分类。《电气与电子工程师协会模式分析与机器智能汇刊》 36(3),453465(2013)

  1. Lange, M.D., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G.G., Tuytelaars, T.: Continual learning: A comparative study on how to defy forgetting in classification tasks. arXiv:1909.08383 (2019)
  1. 兰格(Lange),M.D.;阿尔琼迪(Aljundi),R.;马萨纳(Masana),M.;帕里索(Parisot),S.;贾(Jia),X.;伦纳迪斯(Leonardis),A.;斯拉博(Slabaugh),G.G.;蒂耶特拉尔斯(Tuytelaars),T.:持续学习:关于如何在分类任务中克服遗忘的比较研究。预印本arXiv:1909.08383 (2019年)

  1. Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: Proceedings of the IEEE international conference on computer vision. pp. 5542-5550 (2017)
  1. 李(Li),D.;杨(Yang),Y.;宋(Song),Y.Z.;霍斯佩代尔斯(Hospedales),T.M.:更深、更广、更具艺术性的领域泛化。见:《电气与电子工程师协会国际计算机视觉会议论文集》。第5542 - 5550页(2017年)

  1. Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Learning to generalize: Meta-learning for domain generalization. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
  1. 李(Li),D.;杨(Yang),Y.;宋(Song),Y.Z.;霍斯佩代尔斯(Hospedales),T.M.:学习泛化:用于领域泛化的元学习。见:《第三十二届美国人工智能协会会议论文集》(2018年)

  1. Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y.Z., Hospedales, T.M.: Episodic training for domain generalization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1446-1455 (2019)
  1. 李(Li),D.;张(Zhang),J.;杨(Yang),Y.;刘(Liu),C.;宋(Song),Y.Z.;霍斯佩代尔斯(Hospedales),T.M.:用于领域泛化的情景训练。见:《电气与电子工程师协会国际计算机视觉会议论文集》。第1446 - 1455页(2019年)

  1. Li, H., Jialin Pan, S., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5400-5409 (2018)
  1. 李(Li),H.;潘嘉琳(Jialin Pan),S.;王(Wang),S.;科特(Kot),A.C.:基于对抗特征学习的领域泛化。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第5400 - 5409页(2018年)

  1. Mancini, M., Bulo, S.R., Caputo, B., Ricci, E.: Adagraph: Unifying predictive and continuous domain adaptation through graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6568-6577 (2019)
  1. 曼奇尼(Mancini),M.;布洛(Bulo),S.R.;卡普托(Caputo),B.;里奇(Ricci),E.:Adagraph:通过图统一预测性和连续性领域自适应。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第6568 - 6577页(2019年)

  1. Mancini, M., Karaoguz, H., Ricci, E., Jensfelt, P., Caputo, B.: Kitting in the wild through online domain adaptation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1103-1109 (2018)
  1. 曼奇尼(Mancini),M.;卡拉奥古兹(Karaoguz),H.;里奇(Ricci),E.;延斯费尔特(Jensfelt),P.;卡普托(Caputo),B.:通过在线领域自适应进行野外套件装配。见:《电气与电子工程师协会/日本机器人协会智能机器人与系统国际会议论文集》(IROS)。第1103 - 1109页(2018年)

  1. Mancini, M., Porzi, L., Bulo, S.R., Caputo, B., Ricci, E.: Inferring latent domains for unsupervised deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)
  1. 曼奇尼(Mancini),M.;波尔齐(Porzi),L.;布洛(Bulo),S.R.;卡普托(Caputo),B.;里奇(Ricci),E.:推断无监督深度领域自适应的潜在领域。《电气与电子工程师协会模式分析与机器智能汇刊》(2019年)

  1. Mancini, M., Porzi, L., Rota Bulò, S., Caputo, B., Ricci, E.: Boosting domain adaptation by discovering latent domains. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3771-3780 (2018)
  1. 曼奇尼(Mancini, M.)、波尔齐(Porzi, L.)、罗塔·布洛(Rota Bulò, S.)、卡普托(Caputo, B.)、里奇(Ricci, E.):通过发现潜在领域来促进领域自适应。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第3771 - 3780页(2018年)

  1. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations Workshop Track Proceedings (2013)
  1. 米科洛夫(Mikolov, T.)、陈(Chen, K.)、科拉多(Corrado, G.)、迪恩(Dean, J.):向量空间中词表示的高效估计。第一届学习表征国际会议研讨会论文集(2013年)

  1. Muandet, K., Balduzzi, D., Schölkopf, B.: Domain generalization via invariant feature representation. In: International Conference on Machine Learning. pp. 10- 18 (2013)
  1. 穆安代特(Muandet, K.)、巴尔杜齐(Balduzzi, D.)、肖尔科普夫(Schölkopf, B.):通过不变特征表示实现领域泛化。见:《国际机器学习会议》。第10 - 18页(2013年)

  1. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing. pp. 722-729. IEEE (2008)
  1. 尼尔兹巴克(Nilsback, M.E.)、齐瑟曼(Zisserman, A.):大规模类别花卉的自动分类。见:2008年第六届印度计算机视觉、图形与图像处理会议。第722 - 729页。电气与电子工程师协会(2008年)

  1. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision. pp. 69-84. Springer (2016)
  1. 诺鲁齐(Noroozi, M.)、法瓦罗(Favaro, P.):通过解决拼图游戏进行视觉表示的无监督学习。见:《欧洲计算机视觉会议》。第69 - 84页。施普林格出版社(2016年)

  1. Patterson, G., Hays, J.: Sun attribute database: Discovering, annotating, and recognizing scene attributes. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2751-2758. IEEE (2012)
  1. 帕特森(Patterson, G.)、海斯(Hays, J.):太阳属性数据库:发现、标注和识别场景属性。见:2012年电气与电子工程师协会计算机视觉与模式识别会议。第2751 - 2758页。电气与电子工程师协会(2012年)

  1. Peng, K.C., Wu, Z., Ernst, J.: Zero-shot deep domain adaptation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 764-781 (2018)
  1. 彭(Peng, K.C.)、吴(Wu, Z.)、恩斯特(Ernst, J.):零样本深度领域自适应。见:《欧洲计算机视觉会议(ECCV)论文集》。第764 - 781页(2018年)

  1. Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1406-1415 (2019)
  1. 彭(Peng, X.)、白(Bai, Q.)、夏(Xia, X.)、黄(Huang, Z.)、萨内科(Saenko, K.)、王(Wang, B.):多源领域自适应的矩匹配。见:《电气与电子工程师协会国际计算机视觉会议论文集》。第1406 - 1415页(2019年)

  1. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 779-788 (2016)
  1. 雷德蒙(Redmon, J.)、迪瓦拉(Divvala, S.)、吉里什克(Girshick, R.)、法尔哈迪(Farhadi, A.):你只看一次:统一的实时目标检测。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第779 - 788页(2016年)

  1. Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 49-58 (2016)
  1. 里德(Reed, S.)、阿卡塔(Akata, Z.)、李(Lee, H.)、席勒(Schiele, B.):细粒度视觉描述的深度表示学习。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第49 - 58页(2016年)

  1. Roy, S., Siarohin, A., Sangineto, E., Bulo, S.R., Sebe, N., Ricci, E.: Unsupervised domain adaptation using feature-whitening and consensus loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9471-9480 (2019)
  1. 罗伊(Roy, S.)、西亚罗欣(Siarohin, A.)、桑吉内托(Sangineto, E.)、布洛(Bulo, S.R.)、塞贝(Sebe, N.)、里奇(Ricci, E.):使用特征白化和共识损失的无监督领域自适应。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第9471 - 9480页(2019年)

  1. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International journal of computer vision 115(3),211252 (2015)
  1. 鲁萨科夫斯基(Russakovsky, O.)、邓(Deng, J.)、苏(Su, H.)、克劳斯(Krause, J.)、萨特希(Satheesh, S.)、马(Ma, S.)、黄(Huang, Z.)、卡尔帕蒂(Karpathy, A.)、科斯拉(Khosla, A.)、伯恩斯坦(Bernstein, M.)等:ImageNet大规模视觉识别挑战赛。《国际计算机视觉杂志》115(3),211252(2015年)

  1. Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8247-8255 (2019)
  1. 舍恩菲尔德(Schonfeld, E.)、易卜拉欣米(Ebrahimi, S.)、辛哈(Sinha, S.)、达雷尔(Darrell, T.)、阿卡塔(Akata, Z.):通过对齐变分自编码器实现广义零样本和少样本学习。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第8247 - 8255页(2019年)

  1. Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P., Sarawagi, S.: Generalizing across domains via cross-gradient training. International Conference on Learning Representations (2018)
  1. 尚卡尔(Shankar, S.)、皮拉特(Piratla, V.)、查克拉巴蒂(Chakrabarti, S.)、乔杜里(Chaudhuri, S.)、乔蒂(Jyothi, P.)、萨拉瓦吉(Sarawagi, S.):通过跨梯度训练实现跨领域泛化。学习表征国际会议(2018年)

  1. Thong, W., Mettes, P., Snoek, C.G.: Open cross-domain visual search. arXiv preprint arXiv:1911.08621 (2019)
  1. 通(Thong, W.)、梅特斯(Mettes, P.)、斯诺克(Snoek, C.G.):开放跨领域视觉搜索。预印本arXiv:1911.08621(2019年)

  1. Verma, V.K., Rai, P.: A simple exponential family framework for zero-shot learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 792-808. Springer (2017)
  1. 维尔马(Verma, V.K.)、拉伊(Rai, P.):零样本学习的简单指数族框架。见:《欧洲机器学习与数据库知识发现联合会议》。第792 - 808页。施普林格出版社(2017年)

  1. Volpi, R., Murino, V.: Addressing model vulnerability to distributional shifts over image transformation sets. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 7980-7989 (2019)
  1. 沃尔皮(Volpi),R.;穆里诺(Murino),V.:解决模型在图像变换集上对分布偏移的脆弱性问题。见:《电气与电子工程师协会国际计算机视觉会议论文集》。第7980 - 7989页(2019年)

  1. Volpi, R., Namkoong, H., Sener, O., Duchi, J.C., Murino, V., Savarese, S.: Generalizing to unseen domains via adversarial data augmentation. In: Advances in Neural Information Processing Systems. pp. 5334-5344 (2018)
  1. 沃尔皮(Volpi),R.;南孔(Namkoong),H.;森纳(Sener),O.;杜奇(Duchi),J.C.;穆里诺(Murino),V.;萨瓦雷塞(Savarese),S.:通过对抗性数据增强实现对未见领域的泛化。见:《神经信息处理系统进展》。第5334 - 5344页(2018年)

  1. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-ucsd birds 200 (2010)
  1. 韦林德(Welinder),P.;布兰森(Branson),S.;米塔(Mita),T.;瓦(Wah),C.;施罗夫(Schroff),F.;贝隆吉(Belongie),S.;佩罗纳(Perona),P.:加州理工学院 - 加州大学圣地亚哥分校鸟类数据集200(2010年)

  1. Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., Schiele, B.: Latent em-beddings for zero-shot classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 69-77 (2016)
  1. 西安(Xian),Y.;阿卡塔(Akata),Z.;夏尔马(Sharma),G.;阮(Nguyen),Q.;海因(Hein),M.;席勒(Schiele),B.:用于零样本分类的潜在嵌入。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第69 - 77页(2016年)

  1. Xian, Y., Choudhury, S., He, Y., Schiele, B., Akata, Z.: Semantic projection network for zero-and few-label semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8256-8265 (2019)
  1. 西安(Xian),Y.;乔杜里(Choudhury),S.;何(He),Y.;席勒(Schiele),B.;阿卡塔(Akata),Z.:用于零标签和少标签语义分割的语义投影网络。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第8256 - 8265页(2019年)

  1. Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learninga comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence 41(9),22512265(2018)
  1. 西安(Xian),Y.;兰佩特(Lampert),C.H.;席勒(Schiele),B.;阿卡塔(Akata),Z.:零样本学习——对好、坏和丑情况的全面评估。《电气与电子工程师协会模式分析与机器智能汇刊》 41(9),22512265(2018)

  1. Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating networks for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5542-5551 (2018)
  1. 西安(Xian),Y.;洛伦茨(Lorenz),T.;席勒(Schiele),B.;阿卡塔(Akata),Z.:用于零样本学习的特征生成网络。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第5542 - 5551页(2018年)

  1. Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning-the good, the bad and the ugly. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4582-4591 (2017)
  1. 西安(Xian),Y.;席勒(Schiele),B.;阿卡塔(Akata),Z.:零样本学习——好、坏和丑的情况。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第4582 - 4591页(2017年)

  1. Xian, Y., Sharma, S., Schiele, B., Akata, Z.: f-vaegan-d2: A feature generating framework for any-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10275-10284 (2019)
  1. 西安(Xian),Y.;夏尔马(Sharma),S.;席勒(Schiele),B.;阿卡塔(Akata),Z.:f - vaegan - d2:用于任意样本学习的特征生成框架。见:《电气与电子工程师协会计算机视觉与模式识别会议论文集》。第10275 - 10284页(2019年)

  1. Xu, M., Zhang, J., Ni, B., Li, T., Wang, C., Tian, Q., Zhang, W.: Adversarial domain adaptation with domain mixup. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence. pp. 6502-6509. AAAI Press (2020)
  1. 徐(Xu),M.;张(Zhang),J.;倪(Ni),B.;李(Li),T.;王(Wang),C.;田(Tian),Q.;张(Zhang),W.:基于领域混合的对抗性领域自适应。见:《第三十四届美国人工智能协会会议论文集》。第6502 - 6509页。美国人工智能协会出版社(2020年)